as promised - more math - single issue scoring - Polygraph Place Bulletin Board

Thanks for stopping by our bulletin board.
Please take just a moment to register so you can post your own questions
and reply to topics. It is free and takes only a minute to register. Just click on the register link

Polygraph Place Bulletin Board

Professional Issues - Private Forum for Examiners ONLY

as promised - more math - single issue scoring

profile | register | preferences | faq | search

next newest topic | next oldest topic

Author	Topic: as promised - more math - single issue scoring
rnelson Member	posted 08-28-2006 04:57 PM mixed issues tests (not single issue tests as the topic suggests) A proposed statistical model for ipsative evaluation of the significance of score in mixed issues examinations. The venerable hand-scoring methods using the 7-position and 3-position scales is easily taught, offers satisfactory reliability, and has withstood scrutiny in countless decision accuracy studies. However, method does not lend itself well to tabulation of inferential statistics commonly used to describe outcome probabilities. Conversely, common statistical models offer an overburden of complexity for field testing situations. Fortunately, computers offer the ability to both automate complex procedures and perform complex calculations with assumed perfect reliability, though reliability should not itself be mistaken for validity or accuracy. While there are common statistical models, such as standard errors of mean differences, hypothesis tests, and z-tests, that can be employed to evaluate the significance of difference between mean scores of comparison and relevant question datasets, no application of a commonly recognized statistical model has been suggested for scoring examinations constructed around mixed issues (i.e., when it is logically conceivable that an examinee could lie to one relevant question while being truthful to another). This is a problem for PCSOT programs, which commonly investigate mixed issues tests, and for which there is a troublesome impulse to mis-apply the existing scoring algorithms. In normative testing circumstances, it is common to evaluate individual scores using the z-score method which returns the number of standard deviations that an individual score differs from the mean score of a sample or population. The resulting standard score can be used to estimate the proportion of a sample or population for which the observed individual score is greater or less than. Comparison question polygraph testing has long employed an ipsative scoring paradigm in which responses to relevant questions are not compared to a normative data model, but to the individual's own response profile via reactions to PLC or DLC comparison questions. The most obvious limitations to the employment of common statistical models are the small sample size inherent in polygraph question sets – usually involving a two to four relevant questions along with three to four comparison questions, across three to five different charts, resulting in datasets of approximately six to 20 data-points for relevant and comparison subsets, and approximately 12 to 40 data-points for pooled relevant and comparison question sets. These small sample sizes generally mean that preferred statistical models involving the standard normal (z) distribution cannot be employed. Instead small sample sets are generally evaluated using the t-distribution using n-1 degrees of freedom. There are known models for evaluating the confidence interval surrounding the mean of a small sample. To calculate the confidence interval of a small sample mean, using the t-distribution at a specified alpha with n-1 degrees of freedom (read as: t sub-alpha with n minus one degrees of freedom)... first calculate the mean and standard deviation for all iterations of all comparison questions (N = # of comparison questions multiplied by the number of charts) it is necessary to have access to a t-distribution tables to find the t value, so, for anyone that needs a t-table and doesn't have one – here is a .csv worksheet http://www.raymondnelson.us/training/ttable.csv to calculate the confidence interval use... t-sub-alpha with n-1 degrees of freedom * (st dev / sqrt(n)) then, to find the lower limit of the confidence range, use... mean score minus the confidence interval, or X-bar – t-sub-alpha with n-1 degrees of freedom and for the upper limit of the confidence interval use.. mean score plus the confidence interval, or X-bar + t-sub-alpha with n-1 degrees of freedom here is a graphic (poor quality) http://www.raymondnelson.us/training/confidence_interval.jpg This lower and upper limit tells us, based upon the sample of comparison question response values, the 1-alpha x 100 percent confidence range in which we can estimate the comparison mean response value will lie if we were to conduct the experiment over and over Relevant question response values that lie outside (above or below) can be expected to be distinct from or unequal to the comparison mean score suggested by the confidence range at the level specified by alpha, and can therefore be regarded as indicative of for the presence or absence of physiological reactions known to be correlated with deception when compared with response values produced by DLCs (I suppose this could also work with PLCs). What this really means is that there is not more than (alpha) percent chance that an observed relevant score outside the (1-alpha x 100) percent confidence range is the mean score for the comparison questions. A more common method of evaluating the significance of individual relevant question scores would be to use the scored value of each individual relevant question (aggregated across all charts) as the point estimate in a hypothesis test for small samples, using the mean and variance scores for all iterations of all comparison questions from all test charts. Here, aggregated values treat multiple iterations of each questions, during multiple test charts as a single value, without any variability, as opposed to treating each iteration of each question as a separate data-point, which thereby introduces variability to the data. Using a small sample hypothesis test of a point estimate and comparison mean scores solves a very important and under-discussed empirical complication underlying comparison question polygraph – measuring the statistically significance absence of something – which can only be accomplished through comparison with something else (an expected response value). Miritello (1999) offered a set of guidelines for tabulation of response scores for both comparison and relevant questions, based on rank order assignment of each component parameter. Using Miritello's procedures, the suggested statistical equations can tabulate the significance of individual relevant questions at a specified level of alpha, depending on the desired level of confidence, tolerance for risk, and cost of errors. One disagreement I have with Miritello is that she reduces ranked values to decimal proportions of the maximum possible response scores from all charts within an examination – which I believe is a mistake as that extinguishes variability that is necessary for some statistical calculations and masks the fact that data was obtained in separate iterations of each question stimulus. It would be generally more correct to regard each iteration of each question (on multiple charts) as a separate data point. This preserves useful variability data, necessary for some statistical estimations, and more accurately reflects the size of the dataset. In other words Miritillo's procedures suggest that N= the number of scored questions, regardless of the number of iterations or test charts, while I suggest that N= the number of scored questions multiplied by the number of iterations or test charts. In a test format with three comparison questions and three test charts, this represents the difference between N=3 and N=9, which has an important influence on the viability of the t value obtained when using n-1 degrees of freedom. However variability is not desired when setting a point-estimate, such as when evaluating the significance of each individual relevant question, so I agree with Miritello's procedure to aggregating response scores for individual (mixed) relevant questions across all charts, for mixed issues. I'm also a little bothered with Miritello's use of the logical rank of zero as the lowest rank order (with the corresponding highest rank within each chart as the number of scored questions minus one). Rank order (ordinal) ratio data schemes do not generally use zero as a rank label, as rank labels are logical and ordinal – there is no rank named zero – the first rank is the first rank. Employing zero here tends to give the false impression that the data might be used to represent a ratio data scheme (which would employ a zero value). I don't believe this affects the actual math much, though it may exaggerate the variability slightly, it is a point of conceptual clarity. Now that this is out of my head, I can concentrate on some work. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Peter Sellers as President Merfin Mufley in Stanley Kubrick's Dr. Strangelove, 1964) [This message has been edited by rnelson (edited 09-05-2006).] IP: Logged
stat Member	posted 08-29-2006 09:04 PM Ray, there are some wonderful new books out there which deal specifically with Aspergers Autism.------HA HA HA You're a gift to the profession man. [This message has been edited by stat (edited 09-01-2006).] IP: Logged
polypro Member	posted 08-30-2006 01:12 PM rnelson, Maybe I've missed something, so please clarify. Are you suggesting an algorithm, which would basically assign a weighted value to each relevant in the question string, and ultimately a call at each spot? IP: Logged
rnelson Member	posted 08-30-2006 02:30 PM Yes, essentially. Mirirello (1999) offered a procedure for calculating a decimal proportion for responses to both relevant and comparison questions, using a rank order procedure. However, she never took it the final step, and never offered any suggestions for decision thresholds or decision guidelines pertaining to those decimal proportions. http://www.raymondnelson.us/qc/Miritello_1999_rank_order_analysis.pdf The fact that she calculates response values to both relevant and comparison questions allows the use of common (to statisticians and researchers) inferential models for small samples - using standard errors, significance tests, and t-distributions. Most existing computer algorithms use "proprietary" math - a bothersome issue and serious impediment to "validating" those models. Ever have to sit in court and explain how those figures were derived??? Ever have to explain how there can be a probability of something greater than 100%??? Also, existing models have not adequately addressed the challenge of of scoring individual relevants in mixed issues tests. This suggestion does just that. I have created a spreadsheet to do the math quickly, once the rank values are tabulated (automatic with the Limestone system). By this weekend (I have to get it done quickly before they change my meds - joke), it will also compute scores based on raw data values in addition to the Miritello rank proportions. This algorithm does not use normed data, but employes inferential tests of probability in an ipsative manner - using data from the comparison questions of each exam. So, now I need a dataset. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Peter Sellers as President Merfin Mufley in Stanley Kubrick's Dr. Strangelove, 1964) [This message has been edited by rnelson (edited 08-30-2006).] IP: Logged
polypro Member	posted 08-30-2006 03:33 PM Any chance of doing the same for other formats - like probing exams or R/I? IP: Logged
rnelson Member	posted 08-30-2006 03:42 PM I don't know why not, its just time spent thinking about stastics - which can make the brain hurt a little. All we are really saying is that "these-here score are not those-there scores" at some statistically significant, measurable, and repeatable level. Other work will have to demonstrate that those differences are correlated with deception and truthtelling. One thing that seems important, is to apply models in ways that are theoretically cogent. For example, it doesn't exactly make sense to simply apply Polyscore to RI test - it wasn't built to do that and there is inadequte theoretical rationale to explain why it would work or what the results would mean. On the other hand some tests are developed atheoretically, and if they work then they work. We have yet to see if my suggestion works. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Peter Sellers as President Merfin Mufley in Stanley Kubrick's Dr. Strangelove, 1964) IP: Logged
polypro Member	posted 08-30-2006 03:49 PM Keep going man! Truth's in the science, brother. That's great. IP: Logged
stat Member	posted 08-30-2006 08:19 PM hmmm ....Interesting IP: Logged
rnelson Member	posted 08-31-2006 11:44 AM How's about a break from all that mind-bending wizard-of-oz degree stuff, heh? Here are a couple of screenshots of a spreadsheet to score single issue tests using Miritello rank values and common hypothesis t-tests using the t-distribution for small samples. http://www.raymondnelson.us/qc/single.jpg and the same data using confidence intervals to evaluate each relevant question with the confidence interval specified by a desired alpha. http://www.raymondnelson.us/qc/mixed.jpg spreadsheets can easily be programmed to default individual questions to INC/NO when any single question or combination of questions produces a DI value. This spreadsheet allows the input of a specified alpha (essentially the false negative rate). In this example you can see the data are evaluated as NSR at an alpha of .1 (which in a two tailed t-test give a 95% likelihood that the comparison mean is actually greater than the relevant mean score). Scoring each question individually returns an NSR result for each question at the .05 level. I think its interesting to be able to calculate the results at varying levels of alpha. If I can figure out how to get my spreadsheet to calculate a P-value that would tell us the lowest level of alpha that could be regarded as significant - otherwise its back to SPSS. The data are from a recent exam for which I have a subpoena to talk to a judge at a motions hearing. I think I'm supposed to talk mostly about polygraph accuracy and error issues, not this test - which I did not conduct. Polyscore and Identify returned NDI results at greater than 99 percent. The hand-score was, of course, NDI. I have a couple of DI confessions this week to plug in next. The point of this exercise was simply to determine whether our ipsative data schemes are sufficiently robust to establish significance using common statistical models. Of course, ipsative data can be normed, and these ideas should be investigated using a dataset. enjoy, r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Peter Sellers as President Merfin Mufley in Stanley Kubrick's Dr. Strangelove, 1964) IP: Logged
rnelson Member	posted 09-01-2006 09:52 AM More ranting about rank scoring. this topic should state mixed issue scoring, not single issues - perhaps a moderator could fix that. I've already ranted about Miritello's (1999) use of a rank value of zero - which I believe causes subtle, but possibly important, exaggeration of variability in the data. With a small data set this may be more important. Another gripe is her use of shared rank values, which is somewhat antithetical to rank scoring, which stabilizes data inconsistencies by emphasizing logical, ordinal values without the assumption of zero or ratios. Sharing rank value lends these assumptions again. Most mathematical ranking schemes will, I believe, have any shared values share the lower rank For example: a classroom full of 2nd grade students are ages 6, 7, 8, and 9. four students are age 6. 12 students are age 7, three are age 8 and 1 is age 9. There are 20 students, but only four ranks. Ranks are whole numbers, there is no 3.5 rank. Overall though, I like Miritello's procedure. I've prepared a spreadsheet and sent it to Limestone who are already incorporating Miritello's rank procedures, using I think the same robust Kircher features employed by OSS. (I've nudged them about these rank issues.) I could probably be talked into disseminating to others here if there is interest. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Peter Sellers as President Merfin Mufley in Stanley Kubrick's Dr. Strangelove, 1964) IP: Logged
J L Ogilvie Moderator	posted 09-01-2006 05:13 PM Not to throw a wrench in the numbers but wouldn't all the test have to be performed pretty much the same by every examiner? I would think that for the statistics to work an examiner would have to have a level of competency to make the numbers hold up. We all know not everyone does a good test. Having said that I now leave on vacation for a week or so. More later Jack ------------------ IP: Logged
rnelson Member	posted 09-02-2006 12:26 AM coupl'a things Jack, I think you may be misunderstanding how ipsative scoring systems work. They differ from normative scoring in that normative scoring compares an individual observation (score) to aggregated or normed data from others - like IQ tests and personality tests. Ipsative scores compare an individual observation (score) to the individual himself. Like “personal best” scores for track and field athletes, baseball statistics, and maybe golf or bowling handicaps. Polygraph has used ipsative scoring models for decades, though polygraph examiners are not always familiar with WONK (words only nerds know) like “ipsative” that are not uncommon to people in other testing sciences. Anytime we compare a relevant question value to a comparison (probable or directed lie) question we are performing an ipsative evaluation. The way our consumers sometimes understand this, in their somewhat inaccurate oversimplified way, is to say the relevant questions are compared to the individual's own baseline – an explanation commonly (around my parts) used by other examiners to address therapists' concerns around their clients who have a lot of “anxiety” - though those same therapist, when asked, have not documented a generalized or acute anxiety disorder, and have not documented a psychotherapeutic treatment plan or psychopharmacological treatment plan for that-there “anxiety” (therapist tend to make a lot of excuses, and sometimes need guidance themselves). Anyway, what we polygraph examiners have always done is to use ipsative scoring schemes, in which we compare an individual's response to a relevant question to his or her response to a PLC or DLC comparison question. I think your questions are based on the assumption of normative decision models, which only partially apply to polygraph. Of course ipsative scores can be normed themselves, and our scoring and decision thresholds have been subsequent to research and analysis for validation. Unlike normative testing paradigms, such as IQ and personality testing, we don't compare the individual's observed response value to normative data. We simply look for a recommended ipsative decision threshold, that has been validated by some research or imparted by our fearless leaders in polygraph school. Standardization of procedures is important, and we are not completely absent of that. To suggest that variability of test administration negates the possibility of mathematical validity is spurious. If that were the case, then polygraph couldn't really be regarded as science (in which case we've been pretending all along.) For ipsative scores to be evaluated against established decision thresholds, all that is necessary is for the examiner to conduct a standardized examination using valid methods and principles. The developers of most existing computer scoring algorithms are, in my view, driven by market and proprietary motives and haven't really shown us their math. Furthermore, they do a couple of funny things now and then (like Axciton offering probability values over 100 %, or Identifi offering a reliability value in place of a significance value, or Polyscore reporting a “probability of deception” where inferential statistics can only estimate the probability of a chance or erroneous result. Furthermore, no-one, with the exception of the appendix of Duttons (2000 I think) article on Krapohl's OSS, and an interesting page or two in Matte's book (in which he describe asymetrical decision thresholds and provides a graph), has really published any of their normative data. (How many of us have though about what those really mean.) The ipsative model I have proposed mirrors existing scoring principles, and uses commonly recognized inferential statistical models to determine the level of significance. These models are not impacted by the number of relevant or comparison questions, or the number of charts, and can be applied to single issue or mixed issues tests. In the end, we had better hope that math and statistics support the validity of our tests, or the polygraph will never be accepted as science. While not to disparage the importance of fuzzier and less measurable forms of brilliant thinking, in this country, we have tended to favor data-based and mathematically supported conclusions. For example: ever wonder why Carl Jung's ideas are not more widely employed? No-body has ever suggested Jung was wrong or off track – its just that his ideas are not measurable, describable in tangible terms, and not subject to repeatable experiments. So standardization is important, but perhaps not the paint-by-numbers type standardization in which everyone everywhere will achieve the exact same black-velvet portrait of Elvis (what's that really worth). Its tempting for us examiners (who are all too used to telling others what the “truth” is) to indict other's work and say “not everyone does a good test.” There may always be some variability in target selection, question language, and comparison question construction. All I've offered is another idea to state in mathematical terms that “these-here relevant scores are not those-there comparison scores.” I'll update the previous posts, as the confidence interval has become less interesting, and I've constructed a spreadsheet using common a small sample (t-distribution) hypothesis test for mixed issues tests, and a common t-test for single issue tests. p.s. Sorry if it sounds like I'm ranting; its late now and I've got a few more days of work before I get a break. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(Peter Sellers as President Merfin Mufley to Russian Ambasador Alesky deSadesky, in Stanley Kubrick's Dr. Strangelove, 1964) IP: Logged
stat Member	posted 09-04-2006 11:39 AM Thanks Ray for clarifying your points with a little more "undergrad" lingo/anecdotes. I think I speak for many when I say that we suspect what your saying is highly valuable, but we may need a translator to get us through some of your variables/structure in order to understand the formulas (bigger picture). What's great is you yourself make a fine translator for your own statements as proven by your last thread. I'm reminded of the disconnect from hardware and software experts within the computer industry. Hardware experts need a little extra TLC when trying to grapple relevant software applications----and visa versa. It's all good IP: Logged
rnelson Member	posted 09-04-2006 10:01 PM I put a little more time into a spreadsheet - to do some of the more mind-numbing math and conceptual stuff automatically. All that is necessary is to input either the rank scores for each parameter of each relevant and comparison question, per Miritello (1999) and specify the desired level of alpha (esentially a confidence level or false negative rate). Alternatively, you can also input the rank/proportion scores for each relevant and comparison question. The worksheet will instantly tabulate the results for single issue and mixed issues (parsed results). For mixed issues tests, individual relevant questions that do not produce DI/SR results will default to INC/NO whenever any one or more question produces a DI/SR score. This way the worksheet will not attempt parse NDI and DI results within a single mixed issues examination. Here is a screenshot with some annotations. http://www.raymondnelson.us/images/single_and_mixed.jpg In this example the test is inconclusive as a single issue test at alpha .05, though will report NDI at alpha .1 (two tailed) which suggests that a one-tailed t-test at alpha .05 will also report an NDI result. This is a mixed issues test on a probation revocation case, and while R1 is NDI at .05 R2 is stubornly INC even at higher alphas (lower confidence). This is consistent with the hand scored results. Identify scores the test as NDI. While this is an austere beginning, I think the proof of concept stage is proving interesting. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(from Dr. Strangelove, 1964) IP: Logged
Barry C Member	posted 09-05-2006 12:49 PM Alright. I keep telling myself I've got to read through this one slowly and respond, but every time I look, it gets longer. I'll read it all eventually, but I've got to run and do some work. Anyhow, has anyone (Ray?) run this by Bruce White at Axciton? I think he's done something similar already. He seldom publishes anything, but if you ask, he'll tell you. IP: Logged
rnelson Member	posted 09-05-2006 05:02 PM Barry, Bruce let me bend his ear over Mongolian BBQ, while he was in Colorado a couple of years ago, over mixed issue scoring concerns. I know he has the White Star thing which vaguely describes the distributions of comparison and relevant scores. In Chart Analysis, he reports a standard score (also known as Z-score) for each question, and it appears he uses a rank order scheme - probably under both chart analysis and White Star. However, the z-test and z-score, using the standard normal distribution, assume a large sample (N>30) and normalcy, and so are somewhat incorrectly applied here. The t-test and t-distribution adjust for small samples. I think it'd be awfly nice if people'd show us their math. Here's mine - annotated in red so you can see how it works. http://www.raymondnelson.us/training/case9-5-06.jpg - a worksheet on a case for which I have a subpoena for tommorrow. I won't be discussing this worksheet, but the judge/magistrate wants me to talk about why two different examiners and two different computer scores can get two different results. She's a really smart magistrate, so it should be a fun talk. I've got a few of these requests lately, so I now have a powerpoint on polygraph accuracy and error estimation. In this example the test data are NDI at alpha (two tailed) of .2 - at higher confidence levels (lower alpha thresholds) R1 parses as INC. and another screenshot at alpha at .1 http://www.raymondnelson.us/training/case9-5-06b.jpg The test is a mixed issues maintenance polygraph, covering the time since discharge from probation. Its a fifteen year old case - removal from sex offender registry action after a babysitting offense when the female subject was age 11. She's been a bit reckless all along, including cheating on her husband who is serving in the military. I don't think they'll remove her from the registry. I also think its interesting for us examiners to know just what data features are exploited in computer analysis. This goes back to J.B.'s statements about the acceptability of a science based on recognized constructs and the idea that we don't measure lies per se but rather the significance (i.e., the statistical significance) of measurable reactions that are correlated with lying. The conceptual challenge (construct challenge) is measuring the stastically significant absence of something (reactions correlated with lying). In comparison question polygraphy we have accomplished this through the evaluation of the differences in reaction to RQs and reactions to CQs. I think its an interesting distraction from paid work to try to put this together in a way that can be easily explained to people who are knowledgeable about statistics and measurement. One of the criticisms, I recall, from the NAS was the "black-box" mentality surrounding computer scoring algorithms. Sure they may work OK, but its always better science to understand the theory and principles that account for why it works. When the NAS suggested polygraph lacks validity and others, I believe they were referring to construct validity (basic science and measurement) problems such as this. NAS also had quite a few favorable things to say, and I always use it as a reference when training other professionals. I think it helps innocculate those professionals against offenders' and other anti-polygraph folks impulse to use the NAS report to disparage the polygraph test. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(from Dr. Strangelove, 1964) [This message has been edited by rnelson (edited 09-06-2006).] IP: Logged
rnelson Member	posted 09-22-2006 09:35 AM Its been a couple of late nights, but I now have a spreadsheet to do significance testing of the difference between relevant and comparison question scores, using measurment data from hand scoring or measurements obtained from computerized procedures. I've also created a method for testing the significance of handscored data, by interpolating the values of comparison questions from the relevant question scores - its interesting. The spreadsheet uses a standard t-test procedure for small samples to evaluate the difference between the mean values of comparison and relevant questions. The significance of individual questions on mixed issues tests is evaluated using a standard hypothesis test for small samples, using each relevant question as a point estimate of the comparison mean. Significance can be tested at varied or selectable levels of alpha, and the P-Values of the overall question set and individual questions are reported as the lowest level of alpha at which differences can be accepted as significant. The spreadsheet can accomodate up to five relevant questions, and can handle either hand-scored (7 or 3 position data), or measurements derived from computer or hand-scoring procedures. If desired an entire set of Ir Sy CQ and RQ questions can be entered, and it will score only the CQs and RQs. The statistical methods are robust with any number of RQs and CQs (though two of each are minimumly required) and should work with any established technique. It is also robust with three to five charts - just input the data. When parsing individual questions, the spreadsheet will not attempt to report split-calls. The overall test results is based on all RQs, and the mixed-issues parsing routines will default all questions to INC that do not meet the alpha threshold for a SR/DI result when any one or more questions does produce a SR/DI result. The spreadsheet can maintain a database of scored exams for later review. Now its time for some testing. Here is a single confirmed case example. handscores http://www.raymondnelson.us/qc/handscore_9-22-06.pdf and computer scores derived from http://www.raymondnelson.us/qc/measurement_score_9-22-06.pdf I know, its just one confirmed case. But it was fun. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(from Dr. Strangelove, 1964) IP: Logged

All times are PT (US)	next newest topic \| next oldest topic
Administrative Options: Close Topic \| Archive/Move \| Delete Topic
	Hop to:

Contact Us | The Polygraph Place